91 research outputs found
Towards Data-Driven Autonomics in Data Centers
Continued reliance on human operators for managing data centers is a major
impediment for them from ever reaching extreme dimensions. Large computer
systems in general, and data centers in particular, will ultimately be managed
using predictive computational and executable models obtained through
data-science tools, and at that point, the intervention of humans will be
limited to setting high-level goals and policies rather than performing
low-level operations. Data-driven autonomics, where management and control are
based on holistic predictive models that are built and updated using generated
data, opens one possible path towards limiting the role of operators in data
centers. In this paper, we present a data-science study of a public Google
dataset collected in a 12K-node cluster with the goal of building and
evaluating a predictive model for node failures. We use BigQuery, the big data
SQL platform from the Google Cloud suite, to process massive amounts of data
and generate a rich feature set characterizing machine state over time. We
describe how an ensemble classifier can be built out of many Random Forest
classifiers each trained on these features, to predict if machines will fail in
a future 24-hour window. Our evaluation reveals that if we limit false positive
rates to 5%, we can achieve true positive rates between 27% and 88% with
precision varying between 50% and 72%. We discuss the practicality of including
our predictive model as the central component of a data-driven autonomic
manager and operating it on-line with live data streams (rather than off-line
on data logs). All of the scripts used for BigQuery and classification analyses
are publicly available from the authors' website.Comment: 12 pages, 6 figure
Towards Operator-less Data Centers Through Data-Driven, Predictive, Proactive Autonomics
Continued reliance on human operators for managing data centers is a major
impediment for them from ever reaching extreme dimensions. Large computer
systems in general, and data centers in particular, will ultimately be managed
using predictive computational and executable models obtained through
data-science tools, and at that point, the intervention of humans will be
limited to setting high-level goals and policies rather than performing
low-level operations. Data-driven autonomics, where management and control are
based on holistic predictive models that are built and updated using live data,
opens one possible path towards limiting the role of operators in data centers.
In this paper, we present a data-science study of a public Google dataset
collected in a 12K-node cluster with the goal of building and evaluating
predictive models for node failures. Our results support the practicality of a
data-driven approach by showing the effectiveness of predictive models based on
data found in typical data center logs. We use BigQuery, the big data SQL
platform from the Google Cloud suite, to process massive amounts of data and
generate a rich feature set characterizing node state over time. We describe
how an ensemble classifier can be built out of many Random Forest classifiers
each trained on these features, to predict if nodes will fail in a future
24-hour window. Our evaluation reveals that if we limit false positive rates to
5%, we can achieve true positive rates between 27% and 88% with precision
varying between 50% and 72%.This level of performance allows us to recover
large fraction of jobs' executions (by redirecting them to other nodes when a
failure of the present node is predicted) that would otherwise have been wasted
due to failures. [...
Gene regulatory network modelling with evolutionary algorithms -an integrative approach
Building models for gene regulation has been an important aim of Systems Biology over the past years, driven by the large amount of gene expression data that has become available. Models represent regulatory interactions between genes and transcription factors and can provide better understanding of biological processes, and means of simulating both natural and perturbed systems (e.g. those associated with disease). Gene regulatory network
(GRN) quantitative modelling is still limited, however, due to data issues such as noise and restricted length of time series, typically used for GRN reverse engineering. These issues create an under-determination problem, with many models possibly fitting the data. However,
large amounts of other types of biological data and knowledge are available, such as cross-platform measurements, knockout experiments, annotations, binding site affinities for transcription factors and so on. It has been postulated that integration of these can improve
model quality obtained, by facilitating further filtering of possible models. However, integration is not straightforward, as the different types of data can provide contradictory information, and are intrinsically noisy, hence large scale integration has not been fully
explored, to date. Here, we present an integrative parallel framework for GRN modelling, which employs
evolutionary computation and different types of data to enhance model inference. Integration is performed at different levels. (i) An analysis of cross-platform integration of time series microarray data, discussing the effects on the resulting models and exploring crossplatform
normalisation techniques, is presented. This shows that time-course data integration is possible, and results in models more robust to noise and parameter perturbation, as
well as reduced noise over-fitting. (ii) Other types of measurements and knowledge, such as knock-out experiments, annotated transcription factors, binding site affinities and promoter sequences are integrated within the evolutionary framework to obtain more plausible GRN models. This is performed by customising initialisation, mutation and evaluation of candidate model solutions. The different data types are investigated and both qualitative and
quantitative improvements are obtained. Results suggest that caution is needed in order to obtain improved models from combined data, and the case study presented here provides
an example of how this can be achieved. Furthermore, (iii), RNA-seq data is studied in comparison to microarray experiments, to identify overlapping features and possibilities of integration within the framework. The extension of the framework to this data type is
straightforward and qualitative improvements are obtained when combining predicted interactions
from single-channel and RNA-seq datasets
Egalitarianism in the rank aggregation problem: a new dimension for democracy
Winner selection by majority, in an election between two candidates, is the
only rule compatible with democratic principles. Instead, when the candidates
are three or more and the voters rank candidates in order of preference, there
are no univocal criteria for the selection of the winning (consensus) ranking
and the outcome is known to depend sensibly on the adopted rule. Building upon
XVIII century Condorcet theory, whose idea was to maximize total voter
satisfaction, we propose here the addition of a new basic principle (dimension)
to guide the selection: satisfaction should be distributed among voters as
equally as possible. With this new criterion we identify an optimal set of
rankings. They range from the Condorcet solution to the one which is the most
egalitarian with respect to the voters. We show that highly egalitarian
rankings have the important property to be more stable with respect to
fluctuations and that classical consensus rankings (Copeland, Tideman, Schulze)
often turn out to be non optimal. The new dimension we have introduced
provides, when used together with that of Condorcet, a clear classification of
all the possible rankings. By increasing awareness in selecting a consensus
ranking our method may lead to social choices which are more egalitarian
compared to those achieved by presently available voting systems.Comment: 18 pages, 14 page appendix, RateIt Web Tool:
http://www.sapienzaapps.it/rateit.php, RankIt Android mobile application:
https://play.google.com/store/apps/details?id=sapienza.informatica.rankit.
Appears in Quality & Quantity, 10 Apr 2015, Online Firs
A Big Data Analyzer for Large Trace Logs
Current generation of Internet-based services are typically hosted on large
data centers that take the form of warehouse-size structures housing tens of
thousands of servers. Continued availability of a modern data center is the
result of a complex orchestration among many internal and external actors
including computing hardware, multiple layers of intricate software, networking
and storage devices, electrical power and cooling plants. During the course of
their operation, many of these components produce large amounts of data in the
form of event and error logs that are essential not only for identifying and
resolving problems but also for improving data center efficiency and
management. Most of these activities would benefit significantly from data
analytics techniques to exploit hidden statistical patterns and correlations
that may be present in the data. The sheer volume of data to be analyzed makes
uncovering these correlations and patterns a challenging task. This paper
presents BiDAl, a prototype Java tool for log-data analysis that incorporates
several Big Data technologies in order to simplify the task of extracting
information from data traces produced by large clusters and server farms. BiDAl
provides the user with several analysis languages (SQL, R and Hadoop MapReduce)
and storage backends (HDFS and SQLite) that can be freely mixed and matched so
that a custom tool for a specific task can be easily constructed. BiDAl has a
modular architecture so that it can be extended with other backends and
analysis languages in the future. In this paper we present the design of BiDAl
and describe our experience using it to analyze publicly-available traces from
Google data clusters, with the goal of building a realistic model of a complex
data center.Comment: 26 pages, 10 figure
Opinion dynamics with disagreement and modulated information
Opinion dynamics concerns social processes through which populations or
groups of individuals agree or disagree on specific issues. As such, modelling
opinion dynamics represents an important research area that has been
progressively acquiring relevance in many different domains. Existing
approaches have mostly represented opinions through discrete binary or
continuous variables by exploring a whole panoply of cases: e.g. independence,
noise, external effects, multiple issues. In most of these cases the crucial
ingredient is an attractive dynamics through which similar or similar enough
agents get closer. Only rarely the possibility of explicit disagreement has
been taken into account (i.e., the possibility for a repulsive interaction
among individuals' opinions), and mostly for discrete or 1-dimensional
opinions, through the introduction of additional model parameters. Here we
introduce a new model of opinion formation, which focuses on the interplay
between the possibility of explicit disagreement, modulated in a
self-consistent way by the existing opinions' overlaps between the interacting
individuals, and the effect of external information on the system. Opinions are
modelled as a vector of continuous variables related to multiple possible
choices for an issue. Information can be modulated to account for promoting
multiple possible choices. Numerical results show that extreme information
results in segregation and has a limited effect on the population, while milder
messages have better success and a cohesion effect. Additionally, the initial
condition plays an important role, with the population forming one or multiple
clusters based on the initial average similarity between individuals, with a
transition point depending on the number of opinion choices
Data integration for microarrays: enhanced inference for gene regulatory networks
Microarray technologies have been the basis of numerous important findings regarding gene expression in the last decades. Studies have generated large amounts of data describing various processes, which, due to the existence of public databases, are widely available for further analysis. Given their lower cost and higher maturity compared to newer sequencing technologies, these data continue to be produced, even though data quality has been the subject of some debate. However, given the large volume of data generated, integration can help overcome some issues related e.g. to noise or reduced time resolution, while providing additional insight on features not directly addressed by sequencing methods. Here we present an integration test case based on public Drosophila melanogaster datasets (gene expression, binding site affinities, known interactions). Using an evolutionary computation framework, we show how integration can enhance the ability to recover transcriptional gene regulatory networks from these data, as well as indicating which data types are more important for quantitative and qualitative network inference. Our results show a clear improvement in performance when multiple data sets are integrated, indicating that microarray data will remain a valuable and viable resource for some time to come
Algorithmic bias amplifies opinion polarization: A bounded confidence model
The flow of information reaching us via the online media platforms is
optimized not by the information content or relevance but by popularity and
proximity to the target. This is typically performed in order to maximise
platform usage. As a side effect, this introduces an algorithmic bias that is
believed to enhance polarization of the societal debate. To study this
phenomenon, we modify the well-known continuous opinion dynamics model of
bounded confidence in order to account for the algorithmic bias and investigate
its consequences. In the simplest version of the original model the pairs of
discussion participants are chosen at random and their opinions get closer to
each other if they are within a fixed tolerance level. We modify the selection
rule of the discussion partners: there is an enhanced probability to choose
individuals whose opinions are already close to each other, thus mimicking the
behavior of online media which suggest interaction with similar peers. As a
result we observe: a) an increased tendency towards polarization, which emerges
also in conditions where the original model would predict convergence, and b) a
dramatic slowing down of the speed at which the convergence at the asymptotic
state is reached, which makes the system highly unstable. Polarization is
augmented by a fragmented initial population
- …